Applied Intelligence
Grammatical facial expression recognition using customized deep neural network
architecture
--Manuscript Draft--
Manuscript Number: APIN-D-17-01085
Full Title: Grammatical facial expression recognition using customized deep neural network
architecture
Article Type: Original Submission
Keywords: Computer Vision; Grammatical facial expression recognition; Brazilian sign language
classification; Customized deep neural network
Corresponding Author: Devesh Walawalkar, B.tech
Veermata Jijabai Technological Institute
Mumbai, Maharashtra INDIA
Corresponding Author Secondary
Information:
Corresponding Author's Institution: Veermata Jijabai Technological Institute
Corresponding Author's Secondary
Institution:
First Author: Devesh Walawalkar, B.tech
First Author Secondary Information:
Order of Authors: Devesh Walawalkar, B.tech
Order of Authors Secondary Information:
Funding Information:
Abstract: This paper proposes to expand the visual understanding capacity of computers by
helping it recognize human sign language more efficiently. This is carried out through
recognition of facial expressions, which accompany the hand signs used in this
language. This paper specially focuses on the popular Brazilian sign language
(LIBRAS). While classifying different hand signs into their respective word meanings
has already seen much literature dedicated to it, the emotions or intention with which
the words are expressed haven't primarily been taken into consideration. As from our
normal human experience, words expressed with different emotions or mood can have
completely different meanings attached to it. Lending computers the ability of
classifying these facial expressions, can help add another level of deep understanding
of what the deaf person exactly wants to communicate. The proposed idea is
implemented through a deep neural network having a customized architecture. This
helps learning specific patterns in individual expressions much better as compared to a
generic approach. With an overall accuracy of 98.04\% , the implemented deep
network performs excellently well and thus is fit to be used in any given practical
scenario.
Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Applied Intelligence manuscript No.
(will be inserted by the editor)
Grammatical facial expression recognition using
customized deep neural network architecture
Devesh Walawalkar
Received: date / Accepted: date
Abstract This paper proposes to expand the visual understanding capacity of
computers by helping it recognize human sign language more efficiently. This is
carried out through recognition of facial expressions, which accompany the hand
signs used in this language. This paper specially focuses on the popular Brazilian
sign language (LIBRAS). While classifying different hand signs into their respec-
tive word meanings has already seen much literature dedicated to it, the emotions
or intention with which the words are expressed haven’t primarily been taken
into consideration. As from our normal human experience, words expressed with
different emotions or mood can have completely different meanings attached to
it. Lending computers the ability of classifying these facial expressions, can help
add another level of deep understanding of what the deaf person exactly wants to
communicate. The proposed idea is implemented through a deep neural network
having a customized architecture. This helps learning specific patterns in individ-
ual expressions much better as compared to a generic approach. With an overall
accuracy of 98.04% , the implemented deep network performs excellently well and
thus is fit to be used in any given practical scenario.
Keywords Computer Vision · Grammatical facial expression recognition ·
Brazilian sign language classification · Customized deep neural network
1 Introduction
Sign language is an essential medium used by deaf people to communicate with
other people in their environment. As sign language doesn’t have a speech compo-
nent through which an average human conveys the emotion behind what he says
or does, the facial expressions assumes this important role in a sign language.
Devesh Walawalkar
Bachelor of Technology in Electronics from V.J.T.I.,Mumbai
1005,11th floor,Hrishikesh Apts.,
Veer Savarkar Marg,Dadar(W),Mumbai,India-400028
Tel.: +919820143154
E-mail: devwalkar64@gmail.com
ORCID:0000-0001-9464-9027
Manuscript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
2 Devesh Walawalkar
A computer trained only to understand the language through hand gestures would
fail to understand the semantic and structural level context of what the person
tries to convey. A lot of literature on this topic [1,3,4,8,9,12,16–18] has primarily
focused on sign language recognition through hand gestures only without consid-
ering its facial expressions aspect. Combining classification of facial expressions
along with hand gestures would result in a more efficient interpretation [14,19].
These facial expressions are called ‘Grammatical Facial Expressions’ (GFEs) as
they help to resolve the semantic level ambiguity in human sign language. Facial
expression recognition has attracted attention over recent years, because of the fact
that it can be very useful in many applications such as speech recording systems
which uses sign language to normal language text conversion , subtitling a video
in which sign language is conveyed etc. Neural network techniques are used for
this topic as it is very efficient in learning complex functions when given enough
training data. Previous work on GFE classification [2] is based upon traditional
classification methods, thus in turn failing to leverage potential of recent deep
learning developments. This paper presents comparison of proposed model perfor-
mances with those stated by Freitas et al. [10], with both models being computed
on the same dataset. Performance comparisons with a generic fully connected neu-
ral network have also been presented.
The presented paper structure is as follows: 1] Demonstration of the fundamen-
tal classes (markers) through which wide variety of GFEs can be classified. 2]
The incorporated dataset and its detailed description, 3] Implementation of cus-
tomized deep neural network architecture, 4] Network initialization and its hyper
parameter tuning, 5] Cost function and Optimization algorithm used, 6] Binary
and Multiclass classification system performance results, 7] Comparison of binary
classification performance with that of an accepted method present in literature
8] Final discussion of achieved results and its implications.
2 Importance of grammatical facial expressions
Sign language consists of mainly two components: manual and non-manual [5].
The manual components consist of hand shape, palm orientation and arm move-
ment. The non-manual components consist of facial expressions, pose and mouth
movement. Some signs can be distinguished from manual components only, while
rest need the additional non-manual component to distinguish them. The Brazilian
sign language system consists of certain words which have nearly identical hand
sign formation. They differ from each other only in terms of the facial expression
with which they are said. Hence Sign language recognition only through manual
cues leads to inefficient and ambiguous classification.
Facial expressions play a vital role in effectively communicating the information
through to the listener. In a common language such as English, the exclamation
mark, question mark, the comma etc. provides the emotion attached to a said
sentence by the source. As reshuffling the comma in different positions in same
sentence can completely change the meaning of it, in the same way change or
absence of GFEs in a sign language can completely change its meaning.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 3
3 Database Collection
This paper is based upon empirical results computed on ‘Grammatical Facial Ex-
pression Dataset’, created by Freitas et al. [10] and obtained under public license
from University of California, Irvine Machine learning repository [15]. This dataset
is based upon facial expression made by a sign language performer (further men-
tioned as user) captured through individual video frames. There are eight funda-
mental types of grammatical markers in Brazilian sign language [Libras] system
as stated by Brito [6] and de Quadros et al. [7].These are as follows, along with
their meanings:
Wh question: generally used for questions with Who, What, When, Where,
How and Why;
Yes/no question: used when asking a question to which there is a ‘yes’ or
‘no’ answer;
Doubt question: This is not a ‘true’ question since an answer is not expected.
However, it is used to emphasize the information that will be supplied;
Topic: used when one of the sentence’s constituents is displaced to the begin-
ning of the sentence;
Negative: It is used in negative sentences;
Assertion: used when making assertions;
Conditional: used in subordinate sentence to indicate a prerequisite to the
main sentence;
Focus: used to highlight new information into the speech pattern;
The dataset consists of 225 videos recorded in five different recording sessions
carried out with the user. In each session, one performance of each sentence was
recorded.User was asked to perform the sentence from each of the above type
(with an additional Relative marker type, which is used at start of a clause in the
sentence).Examples of these mentioned markers present individually in sentences
in the common English language is given as follows [10]:
Conditional:
1] If you miss, you lose.
2] If it’s sunny, I go to the beach.
Assertion:
1] I bought that!
2] I work there!
Negative:
1] I never have been in jail!
2] I didn’t do anything!
Relative:
1] That enterprise? ... Its business is technology!
2] The girl who fell from bike? ... She is in the hospital!
Focus:
1] The bike is BROKEN.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
4 Devesh Walawalkar
2] It was WAYNE who did that!
Topics:
1] I have a notebook!
2] Fruits ... I like pineapple!
Doubt questions:
1] Did you GRADUATE?
2] Did Waine buy A CAR?
Wh-questions:
1] What is this?
2] Where do you live?
Yes/no questions:
1] Did he go away?
2] Is this yours?
Multiple frames were captured from each of these marker videos and predefined
attribute face points (Figure 1) were located in each of these frames. The X, Y
(frontal image plane) and Z (depth) coordinates of each of these 100 attributes,
for each frame were recorded using a Microsoft Kinect
TM
sensor.
These frames were then hand classified as a binary classification task for each
individual class with help of a sign language expert. This procedure was imple-
mented for two users (A and B) so as to reduce any particular user bias present
in acquired data. The dataset contains 27965 frames in total, classified into 18
different classes (9 for each user). Description of dataset constituents can be seen
in Table 1.
Table 1: No. of positive and negative samples for each GFE class
GFE class Positive Negative
type samples samples
Assertion 541 644
Yes/no question 734 841
Negative 568 596
Topic 510 1789
Conditional 448 1486
Doubt question 1100 421
Focus 446 863
Relative 981 1682
Wh question 643 962
The implemented test set comprises of 30% of the total dataset available.
This gives a sample (frame) count of 400 - 450 samples per class (for binary
classification) for each user.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 5
Fig. 1: Attribute point locations on User face
4 Data pre-processing
For each attribute, its (X, Y) coordinate points are given in pixels, whereas its Z
coordinate is given in mm. As both units are different and hence their numerical
ranges being different, Z score standardization is performed on dataset before using
it in experimentation. This also makes the learnt model invariant to the location
of face in captured frame (i.e. having a different set of attribute numerical values).
Some isolated coordinate values missing from the dataset are represented by a
placeholder value of ‘0.0’. Such random values could lead to wrong model learning.
Hence, these values are replaced by mean of that particular attribute point’s (either
X, Y or Z) remaining sample values present in dataset. This particular modification
was positively supported by enhanced model performance. Each of the marker’s
binary classification dataset contains an unequal number of positive and negative
cases, with on average negative ones being much greater than positive ones (refer
Table 1). This might lead the model to be slightly biased towards learning the
negative pattern. Hence for training, equal number of both classes are considered.
This does lead to an appreciable increase in model performance.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
6 Devesh Walawalkar
* Standardized input data using Z score method.
Fig. 2: Customized Deep neural network architecture
5 Deep neural network architecture
For this model, a customized feed-forward deep network architecture was imple-
mented. It consists of two hidden layers along with the standard input and output
layers. The entire customized architecture can be refer to in Figure 2. Here, for
each sample (frame) the attribute point’s standardized X, Y, Z coordinates are
fed to a single neuron in the first hidden layer. Thus, 100 neurons present in first
hidden layer are tuned to find learning pattern in each of its respective attribute
point’s coordinates. The space represented by first layer can be expressed as,
H1 {V0, V1, V2, ..., Vn, ..., V99} (1)
where Vn {Xn, Yn, Zn}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 7
Table 2: Attribute point groups for different user face regions
User Face region Attribute point range
Left eye 0-7
Right eye 8-15
Left eyebrow 16-25
Right eyebrow 26-35
Nose 36-47
Mouth 48-67
Face contour 68-86
Left & Right iris+nose tip 87-89
Line above left eyebrow 90-94
Line above right eyebrow 95-99
Subsequently, varied clusters of these neurons are fed to specific neurons in
the second hidden layer. As seen from figure 1, certain clusters of attribute points
(i.e. first layer neurons) represent specific parts of human face. These respective
clusters can be referenced from Table 2. Each of the second layer neurons are thus
tuned to learn individual patterns in specific face regions, such as left/right eye,
nose, mouth etc. respectively.
This hidden layer space can be represented as,
H2 {H10, H11, H12, ..., H1n, ..., H19} (2)
where H1n {V0 V7, V8 V15, V16 V25, V26 V35,
V36 V47, V48 V67, V68 V86, V87 V89, V90 V94,
V95 V99}
Output layer consists of two neurons in case of binary classification task and a
varying three to nine neurons in case of Multiclass classification.The second hidden
layer is fully connected to each output neuron. This enables output layer neurons
to fully learn patterns from each of the face regions present in H2 space. Each
neuron in the architecture has an individual bias weight attached it.
6 Initialization and Hyper parameter tuning
The Hyper parameters of network are optimized based upon performance compar-
ision of different models having varying hyper parameter values. The optimized
values used for model training are as shown in Table 3. ‘Tanh’ activation function
is preferred over other functions owing to its better performance for this model.
‘Softmax with cross entropy’ function is used as activation for the output layer
neurons.The weights of entire network and their biases are initialized using Xavier
Initialization method [11]. Empirically, this initialization is found to perform bet-
ter than random initialization for this model, in turn helping the cost function
initialize closer to its global minimum.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
8 Devesh Walawalkar
Table 3: Model hyper parameters with optimized values
Hyper Parameter Optimized value
Initial learning rate 0.01
learning rate decay ratio 0.9
Rate decay step 7000
Regularization beta 0.05
Epoch 750
7 Network training
The ‘Mean Squared Error’ function is used as the cost function for this model,
expressed as difference between model predictions and its true output values. For
training purpose, ‘Adam’ optimization algorithm [13] was implemented owing to its
faster convergence rate, being computationally efficient and been lesser dependent
on hyper parameter tuning. The model learning rate is decreased exponentially
with a decay ratio of 0.9 in every 7000 iteration steps. This implementation helps
the cost function to reach its global minimum value quicker, such that it does not
overshoot and miss the minimum when it is close to it, owing to larger initial learn-
ing rate. For increasing generalization capacity and to avoid over fitting of model,
l2-norm regularization is used in form of a product term with hyper parameter
‘Regularization Beta’, in order to control its contribution to the cost function.
8 Multiclass classification
For training the network to distinguish between multiple markers, four models
having different number of markers to classify are implemented. For training and
testing the models, positive samples from each marker class in the incorporated
dataset are combined to form separate data subsets. This aggregates to about 200
- 225 samples per marker class per user. These computed models are as follows:
8.1 Three class classification
For this model, various three class combinations of the available nine classes was
implemented. Its architecture modifications included three neurons in the output
layer, activated by ’softmax’ function. Rest of the network hypertuning parameters
were kept the same as in case of binary classification. Marker combinations are
selected separately for User A and User B. Model accuracy results are as shown
in Table 4.
8.2 Five class classification
For this model, various possible five class combinations of the available nine classes
was implemented. Its architecture modifications included five neurons in the out-
put layer, activated by ’softmax’ function. Network hypertuning parameters were
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 9
Table 4: Multiclass classification accuracy for Three classes
Percent Accuracy
Class combinations
a
for test set (%)
User A User B
A,YN,R 97.32 97.89
A,R,F 97.79 97.65
A,R,F 97.54 97.31
A,F,T 97.12 96.97
A,F,C 96.43 96.12
A,W,D 97.57 97.78
YN,R,N 98.42 98.10
N,D,T 97.15 97.23
Mean 97.42 97.38
a
A-Assertion,YN-Yes No questions,R-
Relative,F-Focus,T-Topic,C-
Conditional,D-doubt questions,W-Wh
questions,N-Negative
kept the same as in case of binary classification. Marker combinations are selected
separately for User A and User B. Model accuracy results are as shown in Table
5.
Table 5: Multiclass classification accuracy for Five classes
Percent Accuracy
Class combinations
a
for test set (%)
User A User B
A,YN,R,C,N 95.62 95.83
A,R,F,R,N 96.09 96.25
A,F,T,YN,C 96.23 96.47
A,F,C,R,N 95.51 95.19
A,W,D,R,YN 95.72 95.32
YN,R,N,A,W 95.41 95.83
Mean 95.76 95.82
a
A-Assertion,YN-Yes No questions,R-
Relative,F-Focus,T-Topic,C-
Conditional,D-doubt questions,W-Wh
questions,N-Negative
8.3 Seven class classification
For this model, various possible seven class combinations of the available nine
classes was implemented. Its architecture modifications included seven neurons in
the output layer, activated by ’softmax’ function. Network hypertuning parameters
were kept the same as in case of binary classification. Marker combinations are
selected separately for User A and User B. Model accuracy results are as shown
in Table 6.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
10 Devesh Walawalkar
Table 6: Multiclass classification accuracy for Seven classes
Percent Accuracy
Class combinations
a
for test set (%)
User A User B
A,YN,R,C,N,D,W 95.13 95.36
A,R,F,R,N,D,W 95.17 95.63
A,F,T,YN,C,R,N 94.74 94.93
A,F,C,R,N,W,YN 94.92 94.83
A,T,D,R,YN,N,C 95.07 95.12
YN,R,N,A,W,T,C 95.10 95.19
Mean 95.02 95.18
a
A-Assertion,YN-Yes No questions,R-
Relative,F-Focus,T-Topic,C-
Conditional,D-doubt questions,W-Wh
questions,N-Negative
8.4 Nine class classification
For this model, all the nine possible classes were implemented together. The model
predicted the markers with 95.11% accuracy for User A, 94.93% for User B and
95.06% for User A and B together respectively.
Table 7: Binary classification test set accuracy for all nine classes in comparison with those of
a generic fully connected network
Percent Accuracy for test set
Class type Proposed model (%) Fully connected network (%)
User A User B Both Users User A User B Both Users
Affirmative 98.27 98.60 98.37 78.32 77.17 73.83
Conditional 97.92 97.86 97.79 76.42 76.91 76.39
Relative 97.34 97.73 97.49 82.32 80.63 81.59
Negative 98.48 98.34 98.22 80.45 79.71 79.18
Wh Question 97.36 97.83 97.51 76.82 75.43 75.71
Yn Question 98.02 98.49 98.28 74.51 74.38 74.41
Doubt Question 98.47 98.36 98.11 78.52 78.64 78.01
Topics 97.71 97.59 97.46 81.98 81.27 80.19
Focus 98.76 98.25 98.33 75.65 75.17 74.93
Aggregate Mean 98.04 98.12 97.95 78.33 77.70 77.14
Overall Mean Accuracy 98.04 77.72
9 Results
The accuracy of proposed model on all the markers individually as a binary clas-
sification task is shown in Table 7. It also demonstrates accuracy comparison of
proposed model with that of a generic fully connected network with exact same
number of hidden, input and output layer neurons, keeping all its model hyper
parameters and the optimization algorithm identical. The overall accuracy is cal-
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 11
culated as mean of individual class mean accuracies over the three variants (refer
Table 7), which results to 98.04% . As each marker test set differs in number of
samples, better accuracy representation in terms of F score, precision and recall
is shown in Table 8. These values are further compared with Freitas et al. [10].
For Multiclass classification task, accuracies of three models implemented for each
user can be refer to in Table 4, 5 and 6.
Table 8: Binary classification Performance comparison in terms of F score, Precision and Recall
Class (Freitas et al. [10])
a
This paper
Type F Score Precision Recall F Score Precision Recall
Assertion 0.89 0.98 0.90 0.98 0.97 0.98
Conditional 0.68 0.91 0.55 0.94 0.93 0.96
Relative 0.77 0.99 0.67 0.96 0.99 0.96
Negative 0.69 0.67 0.96 0.96 0.94 0.96
Wh Question 0.87 0.96 0.81 0.98 0.99 0.96
Yn Question 0.83 0.98 0.73 0.94 0.96 0.95
Doubt Question 0.89 0.87 0.94 0.99 0.97 0.98
Topics 0.90 0.95 0.85 0.98 0.98 0.97
Focus 0.91 0.94 0.89 0.99 0.98 0.98
a
Each F-score, precision and recall value considered is the maximum taken across
the four method variations in Freitas et al. [10]
10 Discussion
The F-score, Precision and Recall values obtained are much improved as compared
to those of accepted method present in literature. This validates that a customized
network architecture as proposed here, is better capable of learning the correla-
tion patterns in different face region coordinates for a particular expression type.
The learning constraints provided to the hidden layer neurons in form of the cus-
tomized architecture, help the model learn the patterns more accurately than a
generic fully connected one (Table 7). For Multiclass classification, the model per-
forms equally well and its accuracy remains mostly constant over range of different
number of markers to classify.
Certain considerations include the fact that proposed model was tested on a limited
dataset. This leaves it slightly untested over higher variance in certain attribute
coordinates. Also this method was tested on only two users as available in the
dataset. More users would have GFEs of higher degrees of variance in terms of
structure of their face and method of expressing a particular GFE. A larger and
more varied dataset can thus help to better train and further validate the proposed
model.
11 Conclusion
The overall accuracy of proposed method is excellent, such that it can reliably be
used for classifying GFEs captured in form of video frames. The model performs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
12 Devesh Walawalkar
equally well for both the Binary and Multiclass classification tasks, demonstrating
its ability to correctly distinguish between a GFE and a non-GFE and between
different types of GFEs. Further work on this topic would involve combining this
proposed model with the current accepted methods of hand signs classification, in
order to create a much evolved human sign language recognition system.
Acknowledgements I would like to specially thank Mr. Fernando de Almeida Freitas (Fre-
itas, F. A.), Mr. Felipe Venncio Barbosa (Barbosa, F. V.) and Mr. Sarajane Marques Peres
(Peres, S. M.) for creating the Grammatical Facial Expressions dataset and to ‘University of
Sao Paulo’ for making it available under public license. I would also like to mention University
of California, Irvine machine learning repository for hosting and maintaining this dataset.
References
1. Anjo M D S, Pizzolato E B, Feuerstack S (2012, November) A real-time system to recognize
static gestures of brazilian sign language (libras) alphabet using kinect. In Proceedings of
the 11th Brazilian Symposium on Human Factors in Computing Systems: 59-268. Brazilian
Computer Society.
2. Ari I, Uyar A, Akarun, L (2008, October) Facial feature tracking and expression recognition
for sign language. In ISCIS’23rd International Symposium on Computer and Information
Sciences, 2008:1-6
3. Bastos I L, Angelo M F, Loula A C (2015, August) Recognition of static gestures applied
to Brazilian sign language (Libras). In 28th SIBGRAPI Conference on Graphics, Patterns
and Images (SIBGRAPI), 2015: 305-312
4. Bedregal B C, Costa A C, Dimuro G P (2006, August) Fuzzy rule-based hand gesture
recognition. In IFIP International Conference on Artificial Intelligence in Theory and Prac-
tice Springer, Boston, MA : 285-294
5. Bridges B, Metzger, M (1996) Deaf tend your: Non-manual Signals in American Sign Lan-
guage. Calliope Press.
6. Brito L F (1995) Por uma gramtica de lnguas de sinais. Tempo Brasileiro.
7. de Quadros R M, Karnopp L B (2009) Lngua de sinais brasileira: estudos lingsticos. Artmed
Editora.
8. de Souza C R, Pizzolato E B (2013, July) Sign language recognition with support vector
machines and hidden conditional random fields: going from fingerspelling to natural articu-
lated words. In International Workshop on Machine Learning and Data Mining in Pattern
Recognition : 84-98 Springer Berlin Heidelberg
9. Dias D B, Madeo R C, Rocha T, Bscaro H H, Peres S M (2009, June) Hand movement
recognition for brazilian sign language: a study using distance-based neural networks. In
IJCNN 2009. International Joint Conference on neural networks : 697-704
10. Freitas F A, Peres S M, Lima C A M, Barbosa F V (2014) Grammatical Facial Expressions
Recognition with Machine Learning. In: 27th Florida Artificial Intelligence Research Society
Conference (FLAIRS), 2014, Pensacola Beach. Proceedings of the 27th Florida Artificial
Intelligence Research Society Conference (FLAIRS). Palo Alto: The AAAI Press: 180-185
11. Glorot X, Bengio Y (2010, March) Understanding the difficulty of training deep feedfor-
ward neural networks. In Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics :249-256
12. Kelly D, Reilly Delannoy J, Mc Donald J, Markham C (2009, November) A framework for
continuous multimodal sign language recognition. In Proceedings of the 2009 international
conference on Multimodal interfaces: 351-358
13. Kingma D, Ba J Adam (2014) A method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
14. Krˇnoul Z, Hrz M, Campr P (2010, October) Correlation analysis of facial features and sign
gestures. In, IEEE 10th International Conference on Signal Processing (ICSP),2010: 732-735
15. Lichman M (2013) UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].
Irvine, CA: University of California, School of Information and Computer Science.
16. Pistori H, Neto J (2004) An experiment on handshape sign recognition using adaptive
technology: Preliminary results. Advances in Artificial Intelligence SBIA 2004: 763-801
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Suppressed Due to Excessive Length 13
17. Pizzolato E B, dos Santos Anjo M, Pedroso G C (2010, March) Automatic recognition of
finger spelling for libras based on a two-layer architecture. In Proceedings of the 2010 ACM
Symposium on Applied Computing: 969-973
18. Porfirio A J, Wiggers K L, Oliveira L E, Weingaertner D (2013, October) LIBRAS sign
language hand configuration recognition based on 3D meshes. In IEEE International Con-
ference on Systems, Man, and Cybernetics (SMC), 2013: 1588-1593
19. Von Agris U, Knorr M, Kraiss K F (2008, September) The significance of facial features for
automatic sign language recognition. In 8th IEEE International Conference on Automatic
Face & Gesture Recognition, 2008: 1-6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Figure 1 Click here to
download Figure Fig1-
Figure 2 Click here to download Figure Fig2-eps-converted-
to.pdf